Efficiently Identifying Interesting Time Points in Text Archives
نویسندگان
چکیده
Large scale text archives are increasingly becoming available on the Web. Exploring their evolving contents along both text and temporal dimensions enables us to realize their full potential. Standard keyword queries facilitate exploration along the text dimension only. Recently proposed time-travel keyword queries enable query processing along both dimensions, but require the user to be aware of the exact time point of interest. This may be impractical if the user does not know the history of the query within the collection or is not familiar with the topic. In this work, our aim is to efficiently identify interesting time points in Web archives with an assumption that we receive a result list for a given query in standard relevance-order from an existing retrieval system. We consider two forms of Web archives: (i) one where documents have a publication time-stamp and never change (such as news archives), and (ii) the archives where documents undergo revisions, and are thus versioned. In both settings, we define interestingness as the change in top-k result set of two consecutive time-points. The key step in our solution is the maintenance of top-k results valid at each time-point of the archive, which can then be used to compute the interestingness scores for the time-points. We propose two techniques to realize efficient identification of interesting time points: (i) For the case when documents once published never change, we have a simple but effective technique. (ii) For the more general case with versioned documents, we develop an extension to the segment tree which makes it rank-aware and dynamic. To further improve efficiency, we propose an early termination technique which is proven to be very effective. Our methods are shown to be effective in efficiently finding interesting time points in a set of experiments using the New York Times news archive and the Wikipedia versioned archive.
منابع مشابه
InZeit: Efficiently Identifying Insightful Time Points
Web archives are useful resources to find out about the temporal evolution of persons, organizations, products, or other topics. However, even when advanced text search functionality is available, gaining insights into the temporal evolution of a topic can be a tedious task and often requires sifting through many documents. The demonstrated system named INZEIT (pronounced “insight”) assists use...
متن کاملIndexing methods for web archives
There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text repositories. Web archives are such continuously growing text collections which contain versions of documents spanning over long time periods. Web archives present many opportunities for historical, cultural and political analyses....
متن کاملText Summarization in Data Mining
Text summarizers automatically construct summaries of a naturallanguage document. This paper examines the use of text summarization within data mining, identifying the potential summarizers have for uncovering interesting and unexpected information. It describes the current state of the art in commercial summarization and current approaches to the evaluation of summarizers. The paper then propo...
متن کاملAdapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences
Word senses are not static and may have temporal, spatial or corpus-specific scopes. Identifying such scopes might benefit the existing WSD systems largely. In this paper, while studying corpus specific word senses, we adapt three existing predominant and novel-sense discovery algorithms to identify these corpus-specific senses. We make use of text data available in the form of millions of digi...
متن کاملWork Motivation: A Study on Regular and Part-time Employees of Bangladesh
Nowadays both part-time as well as regular employees are working in many organizations of Bangladesh. Though many studies have been conducted to know the motivation status of regular employees but no study is found that addressed motivations status of both regular and part-time employees of Bangladesh. Thus, this study is conducted on 300 regular and part-time employees of Bangladesh to know th...
متن کامل